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Preface 



“How often we recall, with regret”, wrote Mark Twain about editors, “that 
Napoleon once shot at a magazine editor and missed him and killed a publisher. 
But we remember with charity, that his intentions were good.” Fortunately, we 
live in more forgiving times, and are openly able to express our pleasure at being 
the editors of this volume containing the papers selected for presentation at the 
14tlr International Conference on Inductive Logic Programming. 

ILP 2004 was held in Porto from the 6tlr to the 8th of September, under 
the auspices of the Department of Electrical Engineering and Computing of the 
Faculty of Engineering of the University of Porto (FEUP), and the Laboratorio 
de Inteligencia Artificial e Ciencias da Computagao (LIACC). This annual meet- 
ing of ILP practitioners and curious outsiders is intended to act as the premier 
forum for presenting the most recent and exciting work in the field. Six invited 
talks — three from fields outside ILP, but nevertheless highly relevant to it — and 
20 full presentations formed the nucleus of the conference. It is the full-length 
papers of these 20 presentations that comprise the bulk of this volume. As is now 
common with the ILP conference, presentations made to a “Work-in-Progress” 
track will, hopefully, be available elsewhere. 

We gratefully acknowledge the continued support of Kluwer Academic Pub- 
lishers for the “Best Student Paper” award on behalf of the Machine Learn- 
ing journal; and Springer- Verlag for continuing to publish the proceedings of 
these conferences. The Fundagao para a Ciencia e a Tecnologia, Fundagao Luso- 
Americana para o Desenvolvimento, Fundagao Oriente, Departamento de Engen- 
lraria Electrotecnica e de Computadores, and KDNet, the European Knowledge 
Discovery Network of Excellence have all been extremely generous, and we are 
thankful. Special mention too must be made of Joao Correia Lopes, who orches- 
trated the electronic components of the conference most beautifully. 
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Automated Synthesis of Data Analysis 
Programs: Learning in Logic 



Wray Buntine 



Complex Systems Computation Group 
Helsinki Institute for Information Technology 
P.O. Box 9800, FIN-02015 HUT, Finland 
Wray . BuntineOHIIT . FI 



Program synthesis is the systematic, usually automatic construction of correct 
and efficient executable code from declarative statements. Program synthesis is 
routinely used in industry to generate GUIs and for database support. 

I contend that program synthesis can be applied as a rapid prototyping 
method to the data mining phase of knowledge discovery. Rapid prototyping 
of statistical data analysis algorithms would allow experienced analysts to ex- 
periment with different statistical models before choosing one, but without re- 
quiring prohibitively expensive programming efforts. It would also smooth the 
steep learning curve often faced by novice users of data mining tools and li- 
braries. Finally, it would accelerate dissemination of essential research results. 
For the synthesis task, development on such a system has used a specification 
language that generalizes Bayesian networks, a dependency model on variables. 
With decomposition methods and algorithm templates, the system transforms 
the network through several levels of representation into pseudo-code which can 
be translated into the implementation language of choice. The system applies 
computational logic to make learning work. 

In this talk, I will present the AutoBayes system developed through a long 
program of research and development primarily by Bernd Fischer, Johann Schu- 
mann and others [1,2] at NASA Ames Research Center, starting from a program 
of research by Wray Buntine [3] and Mike Lowry. I will explain the framework 
on a mixture of Gaussians model used in some commercial clustering tools, and 
present some more realistic examples. 



References 

1. Bernd Fischer and Johann Schumann. Autobayes: a system for generating data 
analysis programs from statistical models. J. Fund. Program., 13(3):483-508, 2003. 

2. Bernd Fischer and Johann Schumann. Applying Autobayes to the analysis of plan- 
etary nebulae images. In ASE 2003, pages 337-342, 2003. 

3. W. Buntine, B. Fischer, and T. Pressburger. Towards automated synthesis of data 
mining programs. In Proceedings of the fifth ACM SIGKDD international conference 
on Knowledge discovery and data mining, pages 372-376. ACM Press, 1999. 
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At the Interface of Inductive Logic 
Programming and Statistics 



James Cussens 



Department of Computer Science 
University of York 

Heslington, York, YO10 5DD, United Kingdom 
http : //www-users . cs . york. ac . uk/~ j c/ 



Inductive logic programming can be viewed as a style of statistical inference 
where the model that is inferred to explain the observed data happens to be 
a logic program. In general, logic programs have important differences to other 
models (such as linear models, tree-based models, etc) found in the statistical 
literature. This why we have ILP conferences! 

However, the burden of this talk is that there is much to be gained by sit- 
uating ILP inside the general problem of statistical inference. I will argue that 
this can be most readily achieved within a Bayesian framework. Compared to 
other models a striking characteristic of logic programs is their non-probabilistic 
nature: a query either fails, succeeds (possibly instantiating output variables) or 
does not terminate. Defining a particular probability distribution over possible 
outputs — the hallmark of a statistical model — is not easy to implement with 
‘vanilla’ logic programs. 

Recently, there has been a surge of interest in addressing this lacuna: with 
a number of formalisms proposed (and developed) which explicitly incorpo- 
rate probability distributions within a logic programming framework. Bayesian 
logic programs (BLPs), stochastic logic programs (SLPs), PRISM programs and 
CLP (BAT) programs are just four such proposals. These logic-based develop- 
ments are contemporaneous with the growth of “Statistical Relational Learn- 
ing” (SRL) . In SRL the basic goal is to develop learning techniques for data not 
composed of a set of independent and identically distributed (iid) datapoints 
sitting in a single data table. In other words there is some relationship between 
the data; or, equivalently, there is some structure in the data which it would be 
misleading to ignore. Existing SRL models (PRMs are probably the best-known) 
are not always logical — it remains to be seen how influential statistical ILP will 
be on this area. 

Interestingly, there are related developments emanating from the statistical 
community. Benefitting from more powerful computers and theoretical advances 
concerning conditional independence and Bayesian methods, statisticians can 
now model Highly Structured Stochastic Systems (HSSS ) — this is the title of a 
recent book (and European project) in this area. The ILP community has been 
dealing with “highly structured” learning problems for well over a decade now, 
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so this is potentially an area to which statistical ILP can contribute (and benefit 
from) . 

My own efforts (together with Nicos Angelopoulos) at the intersection of 
logic programming and statistics have centred on combining SLPs with a Markov 
chain Monte Carlo (MCMC) algorithm (the Metropolis-Hastings algorithm, in 
fact) to effect Bayesian inference. We use SLPs to define prior distributions, so 
given a non-SLP prior it would be nice to be able to automatically construct 
an equivalent SLP prior. Since the structure of an SLP is nothing other than a 
logic program this boils down to an ILP problem. So it’s not only that ILP can 
benefit from statistical thinking, statistics can sometimes benefit from ILP. 




From Promising to Profitable Applications of 
ILP: A Case Study in Drug Discovery 



Luc Dehaspe 



PharmaDM and Department of Computer Science 
Katholieke Universiteit Leuven, Belgium 
http : //www. cs .kuleuven. ac .be/'ldh/ 



PharmaDM was founded end 2000 as a spin-off from three European universities 
(Oxford, Aberystwyth, and Leuven) that participated in two subsequent EC 
projects on Inductive Logic Programming (ILP I-II, 1992-1998). Amongst the 
projects highlights was a series of publications that demonstrated the added- 
value of ILP in applications related to the drug discovery process. The mission 
of PharmaDM is to build on those promising results, including software modules 
developed at the founding universities (i.e., Aleph, Tilde, Warmr, ILProlog), and 
develop a profitable ILP based data mining product customised to the needs of 
drug discovery researchers. Technology development at PharmaDM is mostly 
based on “demand pull”, i.e., driven by user requirements. In this presentation 
I will look at the way ILP technology at PharmaDM has evolved over the past 
four years and the user feedback that has stimulated this evolution. 

In the first part of the presentation I will start from the general technology 
needs in the drug discovery industry and zoom in on the data analysis require- 
ments of some categories of drug discovery researchers. One of the conclusions 
will be that ILP — via its ability to handle background knowledge and link multi- 
ple data sources — offers fundamental solutions to central data analysis problems 
in drug discovery, but is only perceived by the user as a solution after is has 
been complemented with (and hidden behind) more mundane technologies. 

In the second part of the presentation I will discuss some research topics that 
we encountered in the zone between promising prototype and profitable product. 
I will use those examples to argue that ILP research would benefit from very 
close collaborations, in a “demand-pull” rather than “technology push” mode, 
with drug discovery researchers. This will however require an initial investment 
of the ILP team to address the immediate software needs of the user, which are 
often not related to ILP. 
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Systems Biology: A New Challenge for ILP 



Steve Oliver 



School of Biological Sciences 
University of Manchester 
United Kingdom 



The generation and testing of hypotheses is widely considered to be the primary 
method by which Science progresses. So much so, that it is still common to 
find a scientific proposal or an intellectual argument damned on the grounds 
that “it has no hypothesis being tested”, “it is merely a fishing expedition”, 
and so on. Extreme versions run “if there is no hypothesis, it is not Science”, 
the clear implication being that hypothesis-driven programmes (as opposed to 
data-driven studies) are the only contributor to the scientific endeavour. This 
misrepresents how knowledge and understanding are actually generated from 
the study of natural phenomena and laboratory experiments. Hypothesis-driven 
and inductive modes of reasoning are not competitive, but complementary, and 
both are required in post-genomic biology. 

Thus, post-genomic biology aims to reverse the reductionist trend that has 
dominated the life sciences for the last 50 years, and adopt a more holistic or 
integrative approach to the study of cells and organisms. Systems Biology is 
central to the post-genomic agenda and there are plans to construct complete 
mathematical models of unicellular organisms, with talk of the ‘virtual E. coli ' , 
the ‘in silico yeast’ etc. In truth, such grand syntheses are a long way off 
not least because much of the quantitative data that will be required, if such 
models are to have predictive value and explanatory power, simply does not 
exist. Therefore, we will have to approach such comprehensive models in an 
incremental fashion, first constructing models of smaller sub-systems (e.g. energy 
generation, cell division etc.) and then integrating these component modules into 
a single construct, representing the entire cell. 

The problem, then, is to ensure that the modules can be joined up in a 
seamless manner to make a complete working model of a living cell that makes 
experimentally testable predictions and can be used to explain empirical data. 
In other words, we do not want to be in a situation, in a five or ten years 
time, where we attempt to join all the sub-system models together, only to 
find that we ‘can’t get there from here’. Preventing such a debacle is partly a 
mechanical problem — we must ensure that the sub-system models are encoded in 
a truly modular fashion and that the individual modules are fully interoperable. 
However, we need something beyond these operational precautions: we require 
an overarching framework within which the models for the different sub-systems 
may be constructed. There is a general awareness of this problem and there is 
much debate about the relative merits of ‘bottom-up’ and ‘top-down’ approaches 
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S. Oliver 



in Systems Biology. In fact, both will be needed — but the foregoing discussion 
demonstrates that the ‘top-down’ approach faces the larger conceptual problems. 

It is difficult to construct an overarching framework for a model of, say, a 
yeast cell when one has no idea what the final model will look like. There are 
two kinds of solution to this problem. One is to build a structure that is simply 
a data model, where different kinds of genomic, functional genomic, genetic, and 
phenotypic data can be stored. We have already done this (in part) for yeast and 
have constructed a data warehouse containing all of these data types. However, 
although the schema for this warehouse looks a bit like a model of a yeast 
cell, this is an illusion — there is no dynamics and the structure is a hierarchical 
one, which (while convenient) is far too simplistic a view of the cell. While it 
would be possible to attach the Systems Biology Modules to the objects in this 
schema, this would be a cumbersome device, and not sufficiently integrative or 
realistic. The second kind of solution is to build a very coarse-grained model 
of the yeast cell based on our current knowledge. This is dangerous since our 
current knowledge is very incomplete, with much relevant data being unavailable 
at present. Such a construct would very likely lead to us being in a ‘can’t get 
there from here’ situation a few years down the road. 

A coarse-grained model is certainly desirable, it would be best to get the 
yeast cell to construct it for us, rather than make an imperfect attempt ourselves. 
How might this be achieved? First, we need a general mathematical framework 
in which to build the coarse-grained model. We have chosen to use the formalism 
of Metabolic Control Analysis, which was developed in part as a shorthand way 
of modelling biochemical genetic systems and metabolism, but is more widely 
applicable since, in effect, it represents a sensitivity analysis of the degree of 
control that different components of a system have over the system as a whole. 
As such, it seems eminently suitable for our purposes. What we now need to do 
is to identify those components of the system that exert the greatest degree of 
control over the pathways in which they participate (or which they regulate). 
In this initial model, we will not concern ourselves with the complications of 
the yeast life cycle (i.e. sex!), but will confine our coarse-grained model to a 
representation of cell growth and mitotic division. Thus, we need to identify 
those components of a yeast cell that exert the greatest degree of control over 
its rate of growth and division. In the parlance of Metabolic Control Analysis, 
these components would be said to have high Flux-Control Coefficients. I will 
describe the sorts of experiments that might be used for such an approach to 
Systems Biology, and discuss the role that ILP could play in both the design 
of these experiments and the derivation of novel insights into biology from the 
data that they produce. 




Scaling Up ILP: Experiences with Extracting 
Relations from Biomedical Text 



Jude Shavlik 

Departments of Computer Sciences and 
Biostatistics and Medical Informatics 
University of Wisconsin-Madison, USA 
http : //www. cs . wise . edu/'shavlik/ 



We have been applying Inductive Logic Programming (ILP) to the task of 
learning how to extract relations from biomedical text (specifically, Medline ab- 
stracts). Our primary focus has been learning to recognize instances of “this 
protein is localized in this part of the cell” from labeled training examples. ILP 
allows one to naturally make use of substantial background knowledge (e. g., 
biomedical ontologies such as the Gene Ontology - GO - and MEdical Subject 
Headings - MESH) and rich representations of the examples (e. g., parse trees). 
We discuss how we formulated this task for ILP and describe our methods for 
scaling ILP to this large task. We conclude with a discussion of some of the 
major challenges that ILP needs to address in order to scale to large tasks. 

Our dataset can be found at ftp://ftp.cs.wisc.edu/machine-learning/ 
shavlik-group/datasets/IE-protein-location/, and two technical publica- 
tions on our research can be found in these proceedings. This research was 
supported by United States National Library of Medicine (NLM) Grant R01 
LM07050-01, DARPA Grant F30602-01-2-0571, and United States Air Force 
Grant F30602-0 1-2-0571. 
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Macro-Operators Revisited in Inductive Logic 

Programming 



Erick Alphonse 



MIG - INRA/UR1077 
F-78352 Jouy eii Josas Cedex France 

ealphons@jouy . inra.fr 



Abstract. For the last ten years a lot of work has been devoted to 
propositionalization techniques in relational learning. These techniques 
change the representation of relational problems to attribute- value prob- 
lems in order to use well-known learning algorithms to solve them. Propo- 
sitionalization approaches have been successively applied to various prob- 
lems but are still considered as ad hoc techniques. In this paper, we study 
these techniques in the larger context of macro-operators as techniques 
to improve the heuristic search. The macro-operator paradigm enables us 
to propose a unified view of propositionalization and to discuss its cur- 
rent limitations. We show that a whole new class of approaches can be 
developed in relational learning which extends the idea of changes of rep- 
resentation to more suited learning languages. As a first step, we propose 
different languages that provide a better compromise than current propo- 
sitionalization techniques between the cost of building macro-operators 
and the cost of learning. It is known that ILP problems can be reformu- 
lated either into attribute-value or multi-instance problems. With the 
macro-operator approach, we see that we can target a new representa- 
tion language we name multi-table. This new language is more expressive 
than attribute-value but is simpler than multi-instance. Moreover, it is 
PAC-learnable under weak constraints. Finally, we suggest that relational 
learning can benefit from both the problem solving and the attribute- 
value learning community by focusing on the design of effective macro- 
operator approaches. 



1 Introduction 

After [1] , concept learning is defined as search : given a hypothesis space defined 
a priori, identified by its representation language, find a hypothesis consistent 
with the learning data. This paper, relating concept learning to search in a space 
state, has enabled machine learning to integrate techniques from problem solv- 
ing, operational research and combinatorics. The search is NP-complete for a 
large variety of languages of interest (e.g. [2,3,4]) and heuristic search is crucial 
for efficiency. If heuristic search has been showed effective in attribute- value lan- 
guages, it appeared early that learning in relational languages, known for more 
than a decade as Inductive Logic Programming (ILP), had to face important 
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plateau phenomenas [5,6,7] : the evaluation function, used to prioritize nodes in 
the refinement graph is constant in parts of the search space, and the search goes 
blind. These plateau phenomenas are the pathological case of heuristic search. 




Fig. 1 . (from [8]) Coverage probability P so i of an example with L constants by a 
hypothesis with m literals (built from 10 variables). The contour level plots, projected 
onto the plane (m, L), correspond to the region where P so i is between 0.99 and 0.01 



An explanation can be given after the seminal work of [8], who studied the 
ILP coverage test within the phase transition framework. As illustrated in figure 
1, the covering test is NP-complete and therefore exhibits a sharp phase tran- 
sition in its coverage probability [9]. If one studies the probability of covering 
an example of a fixed size by a hypothesis given the hypothesis’ size, one distin- 
guishes three well-identified regions: a region, named ”yes” , where the probability 
of covering an example is close to 1, a region, named ” no” , where the probabil- 
ity is close to 0, and finally the phase transition where an example may or may 
not be covered. As the heuristic value of a hypothesis depends on the number 
of examples covered (positive or negative), we see that the two regions “yes” 
and “no” represent plateaus that need to be crossed during search without an 
informative heuristic value. 

The state of the art on evaluation functions used in learning (see for example 
[10]) shows that all of them are based on two main parameters that are the 
rate of positive and negative examples covered. As these two parameters are 
inherited from the definition of the learning task, it is unlikely that a solution 
for solving the plateau problem consists in designing new evaluations functions. 
This problem has been well studied in the problem solving community [11] and 
the solution proposed is based on macro-operators. Macro-operators (macros, for 
short) are refinement operators defined by composition of elementary refinement 
operators. They are able to apply several elementary operators at a time (adding 




10 



E. Alphonse 



or removing several literals in a hypothesis in the case of relational learning) and 
therefore are likely to cross non informative plateaus. The application of macros 
in relational learning is not new and has been investigated in e.g. [5,6,7,12]. In 
these systems, macros are added to the list of initial refinement operators and 
are then classically exploited to search in the refinement graph. In general, this 
type of technique is known in Machine Learning as Constructive Induction [13], 
or Predicate Invention in ILP. However, in practice the induced growth of the 
hypothesis space leads to a decrease of performances [14]. 

As opposed to these approaches, we view macros as a means to simplify 
the search space by considering only macros as elementary operators. In other 
words, our focus is only on the sub-graph generated by the macros. We propose 
to study a macro-operator approach with respect to the complexity of the new 
representation language of the hypothesis space associated with this sub-graph. 
Our approach is similar to abstraction by means of macro-operators [15,11]. 

We are going to show that a large set of ILP approaches, named proposi- 
tionalization after [16], implicitly use macros to select a sub-graph whose repre- 
sentation language is equivalent to attribute- value (AV). In doing so, they can 
delegate the simplified learning problems to well-known AV algorithms [17,18, 
16,19,20,21,22,23]. The advantage of formalizing theses approaches in terms of 
macro-operators is threefold. Firstly, it allows to clarify and motivate the propo- 
sitionalization approaches, which are still considered as ad hoc techniques in 
literature, as techniques to improve the heuristic search. Secondly, by drawing 
a parallel with macro-operators, it allows ILP to benefit from techniques for- 
malized and developed in this framework by the Problem Solving community 
(see e.g. [24,11]). Finally and most importantly, this formalization, by showing 
propositionalization as a two-stage process (creation of macros and delegation of 
the simplified search space to simpler learning algorithms), points to promising 
extensions. Propositionalization as proposed in literature appears to be a special 
case of resolution by macro-operators. If targeting AV languages is interesting 
because of the subsequent use of efficient learning algorithms, the cost of build- 
ing macro-operators for this language is prohibitive and equivalent to the cost of 
’’classical” ILP systems 1 on very simple cases, as we will show it. We propose to 
relax the constraint on the hypothesis space induced by the definition of macros, 
by allowing more expressive languages, to offer a better trade-off between the 
cost of building macros and the cost of learning. 

In the next section, before describing the use of macro-operators to solve ILP 
problems, we are going to recall the different works of various researchers to del- 
egate the resolution of ILP problems to simpler learning algorithms, namely AV 
and multi-instance ones, without loss of information [25,26,3,27,28,29,30]. We 
will use this section to introduce the notations used in the rest of the paper. In 
section 3, we will illustrate the macro-operator approach and we will show how 
we can associate a language to the hypothesis sub-space selected by macros. A 
set of languages that can be chosen by means of macros will be presented. We will 

1 We refer here to ’’classical” ILP systems as learning systems not built on macro- 
operators. 
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then show in section 5 that the language used by current propositionalization 
approaches, which is a determinate language, cannot efficiently delegate learning 
to AV algorithms. In section 6, we will discuss the fc-locale language [3] that ap- 
pears to be the first promising relaxation of the hypothesis space language. We 
will show that this language allows to reformulate the initial ILP problem into 
a new representation language we name multi-table. This language is simpler 
than the multi-instance language [31]. Moreover, in a more constraint form of 
the language, we will see that a monomial is PAC-learnable, which suggests that 
efficient algorithms exist. In section 7, we generalize further the hypothesis sub- 
space language to the fc-free and fc-indeterminate languages and discuss shortly 
the properties of propositionalization in these languages. Finally, we will con- 
clude and argue that the important effort put in propositionalization techniques 
can be pursued in richer languages that lift their current shortcomings. 

2 Propositionalization 

[25,26,32] showed that the ij-determinate language (determinate for short), re- 
stricting both the hypothesis space, £/, , and the instance space, was the stronger 
language bias that could be applied to ILP. Indeed, they proved that it was as 
expressive as an AV language and that it could be compiled in polynomial time 
into it. This language has been used as learning language in LINUS and DI- 
NUS [33]. Their strategy is to change the representation of the determinate ILP 
problem to AV, then to delegate learning to AV algorithms and finally to write 
the learnt theory back into the determinate language. The work of Lavrac and 
Dzeroski has been fruitful to ILP because it suggested that the AV expertise 
could be reused by change of representation of the initial learning problem. 

Extensions of this approach have been proposed in more expressive languages 
that introduce indeterminism [3,27,28,29,30]. In this case, the reformulated ILP 
problem is not an AV problem but a multi-instance one. We now give a general 
description of this change of representation. It is often termed propositional- 
ization in the literature, in a more specific way than [16], as it refers only to 
the change of representation, independently of the use of macro-operators. We 
will use this more specific meaning in the rest of the paper in order to better 
distinguish the different techniques presented in the next section. 

A prerequisite to delegate the resolution of an ILP problem to AV or multi- 
instance algorithms is that the hypothesis spaces in both languages must be 
isomorphic. In the following, we represent such a bias as a hypothesis base (by 
analogy with a vectorial base), noted B. 

Definition 1 (hypothesis base). A hypothesis base, noted B, is of the form: 
B = ( tc h,. . .,l n ,v i, . . . ,v m >) 

with li, . . . ,l n a set of literals and Vi,...,v m a set of constraint variables in 
l\,...,l n . Each hypothesis has the head tc and its body is represented by a vector 
in B. Each coordinate indicates if the literal is present or not, or represents the 
domain of a constraint variable. 
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Table 1 . A train problem and its propositionalization given B 



Examples 
e + west(tl) 
e~ west(t2) 



Background Knowledge 
car(tl,cll). roo/(cll). #loads(cll, 2). 
car(tl,cl2). short(cl2). #loads(cl2,2). 
car(t2, c21). short(c21). #loads(c21, 4). 
car(t2,c22). rect(c22). #loads(c22,2). 





Variables 


Attributes | 




Cl 


C 2 


car(T, Cl) 


roof (Cl) 


ffloads(C 1, 2) 


car(T , C2) 


#loads(C2,N) 


N 
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ell 


T 


T 


T 


T 
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T 


T 
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ell 


T 


F 


T 


T 


T 


2 
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T 
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c21 
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T 


T 
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T 


T 
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c21 


T 


F 


T 


T 


T 


4 




c22 


c22 


T 


F 


T 


T 


T 


2 



By construction, there is a one-to-one mapping between a hypothesis in the 
clausal space and a hypothesis in the AV space. To perform the search in the AV 
search space, the ILP instance space needs to be reformulated to emulate the 
covering test. In other words, the change of representation lrard-codes the cover- 
ing test between the hypotheses and the examples by computing all matchings 
between B and the ILP examples for subsequent use by the AV or multi-instance 
algorithms. If the language is determinate, there will only be one matching and 
therefore an example will be reformulated into one AV vector, and in the case 
where the language is indeterminate, an example will be reformulated into a bag 
of vectors. The fact that the new instance space in the latter case is a multi- 
instance space is due to the definition of the covering test in ILP: to cover an 
ILP positive example, we need to find one matching substitution, and to reject 
an ILP negative example, we need not find any matching substitutions. 

An example of propositionalization of a toy problem a la Michalski’s trains 
under 0-subsumption is given in table 1 with the base: 

B = ( west(T ) < car(T,Cl),roof(Cl),ffloads(Cl,2),car(T,C2), 

#loads(C2,N),N >) 

For lack of space, we assume the reader is familiar with ILP (see e.g. [34]). The 
semantic is classical and for example the first positive example reads : the train 
goes west, its first car has a roof and carries two loads, and its second car is 
short and has two loads as well. As the train representation is indeterminate, 
the new representation obtained by propositionalization is a multi- instance one. 
For a better reading, we omit the instantiation of the head variables of B , as 
they are determinate. 

As noted in [35], the propositionalization is not tractable in the general case: 
the covering test being NP-complete, the reformulation of a single relational ex- 
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ample e can yield an exponential number of AV vectors with respect to the size 
of B and e. However, propositionalization has been used successfully in numer- 
ical learning. As we can see in the example above, B introduces the constraint 
variable N and constraints can be learnt by a multi-instance algorithm. This 
propositionalization dedicated to numerical learning, which can be traced back 
to INDUCE [36], has been developed in [27,37,38,39]. In the rest of the paper, we 
are going to show that this is one of the advantages that motivates the definition 
of macro-operators in richer languages than determinate languages. 



3 Solving ILP Problems by Means of Macro-Operators 



As mentioned previously, macro-operators have already been used in ILP [5,6, 
7,12]. We illustrate their approach in an example showing a plateau at the first 
refinement step. 



Example 1. We run a top-down greedy algorithm a la FOIL [5] on the 
following train-like problem: 



Examples 
ef westftl) 
ef west(t2) 
ef west(t3) 



Background Knowledge 
carA(tl, ell). short{cll). car _2(tl, cl2). 
roof(cl2). car -1(12, c21). short(c21). 
carI2(t2,c22). short(c22). carA(t3,c31). 
roof(c31). carJ2(t3,c32). roof(c32). 



It has two positive examples and one negative example. The following hypothesis 
is consistent with the data: 



west(T) e- carA(T,C),short(C) 

The search starts with a hypothesis with an empty body and considers the possible 
refinements to specialize it. The two literals carA{T, C) and carJ2{T , C) alone do 
not discriminate between the positive and negative examples (each train having 
two cars) and then have the same heuristic value (a zero information gain). The 
learning algorithm has to make a choice and for example it may add carJ2(T, C), 
which does not lead to a solution, when the conjunction car A(T , C ) , short(C) 
discriminates between the classes. The solution to prevent the lack of heuristic 
information is to define a macro-operator in order to specialize the hypothesis 
with this conjunction, crossing the plateau. 

One solution we adopt to define the macro-operators, which is also discussed in 
[6] , is to create a new literal whose definition corresponds to the composition of 
the elementary refinement operators. By adding this literal to the background 
knowledge by saturation (see e.g. [40]), the initial refinement operators can use 
this literal to refine a hypothesis. Classically, we define a macro-operator with re- 
spect to the hypothesis language, its definition corresponding to a valid sequence 
of refinement operators. 
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Definition 2 (macro-operator). A macro- operator is a predicate in. The def- 
inition of m is a set of clauses {Df\ having the head built on m. Each clause is 
built on the hypothesis language Ch, i.e.: 

\/i,3h £ Ch s.t. body(Di) = body(h) 

Example 2. We use the prolog notation of predicate definition to define the 
two following macros: 

carAshort(T) : — carA(T,C),short(C). 
car_2_roof{T) : — car _2(T,C), roof (C). 



The new literals are added to the background knowledge if their definition 
holds in this background knowledge (saturation) , and we have the new learning 
problem: 



Examples 


Background Knowledge 


ef west(t 1) 
ef west(t2) 
ef west(t3) 


car_l(tl, ell). short(cll). car_2(tl,cl2). roof(cl2). 

carA(t2,c21). short(c21). car_2(t2,c22). short(c22). 

carA(t3, c31). roof(c31). car_2(t3,c32). roof(c32). 

car_l_short(tl). car_2_roof(tl). car_l_short(t2). car_2_roof (t3). 



Here, we propose to simplify the representation by using only the newly 
defined literals. In other words, we are not interested in the refinement graph 
augmented with the macros, but only by the sub-graph generated by the macros. 
If we consider again the above example with this approach, we have the new 
representation: 



Examples 


Background knowledge 


ef westitl) 
ef west(t2) 
ef west(t3) 


carA_short(tl). car_2_roo/(fl). 
car A_short(t2) . card2croof(t3). 



We can notice now that only the constants of the examples (tl, t2, f3) are in 
the background knowledge. This language is remarkable because it is a determi- 
nate language and is known to be as expressive as the attribute-value language, 
as explained in the previous section. The search can then be delegated to an 
AV algorithm by propositionalizing the learning examples. This reformulation is 
shown in table 2. 



Table 2. Reformulation of a determinate ILP problem into an AV problem 





Variables 


Attributes | 




T 


carAshort(T) 


car_2croof(T) 


e t 


tl 


T 


T 
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t2 


T 


F 


e i 


13 
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T 



This example is representative of ILP problem solving by macro-operators 
that we propose and we are going to later give a general algorithm (section 4). 
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As far as we know, the first works on this approach, even if they do not refer 
to macros, are due to [17,18]. A lot of subsequent work has been done [16,19, 
20,21,22,23] with the same principle of defining macros to simplify the problem 
into a determinate language, then propositionalizing to delegate learning to AV 
algorithms. We do not discuss the different techniques that they propose to build 
what we refer to as macros as it is out of the scope of the paper. 

Viewing these approaches as a two-stage process points out interesting direc- 
tions. On the one hand, it unifies all the above-mentioned approaches focusing 
on the way they build the macro-operators. Moreover, it proposes a motivation 
for them, as they are still viewed in literature as ad hoc techniques, as a way to 
improve heuristic search in ILP. On the other hand, these approaches appear to 
be a particular case of solving ILP problems by macros. They define macros to 
target the particular determinate language, but we can develop new approaches 
that define macros to target more expressive languages. By doing so, we argue 
that we are going to yield different trade-offs between the cost of building macros 
and the cost of learning. For instance, we are going to show in the next section 
that building macros for a determinate language is as expensive as ’’classical” 
ILP on simple cases. 

Even though all languages from determinate languages, actually used, to re- 
strictions of first-order logic can be used, we restrict ourselves to relational lan- 
guages where propositionalization was proposed. The restriction to these par- 
ticular languages is motivated by the fact that learning algorithms for these 
languages are well-known and efficient and we can hope for better trade-offs. We 
consider all above-mentioned approaches , which have been shown competitive 
with respect to ’’classical” ILP systems, as an empirical validation. 

4 The General Macro-Operator Approach 

We now give the general algorithm for solving ILP problems by means of macro- 
operators: 

1. Choose £g, the language in which the hypothesis base B will be expressed 

2. Build a set of macro-operators with respect to to define B in C&. 

3. Propositionalize the learning data 

4. Apply a learning algorithm tailored to the representation of the new learning 
problem 

5. Write the learnt theory back into the initial ILP language 2 

Note that we mix up the saturation and propositionalization stages (step 3) as 
it is done in the systems proposed in the literature; the cost of saturation, which 
is NP-complete in the general case, is added to the cost of propositionalization. 
The size of the reformulated problem can be straightforwardly deduced from the 

2 This is not always possible depending on the learning algorithm, like neural networks 
for example. Classification of future instances must be done in the propositionalized 
space. 
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hypothesis base. The size is exponential in the number of variables in the base, 
the variables appearing in the head excepted. 

The gain that more expressive languages than determinate languages can 
provide to the macro-operator approach is compensated by the cost of reformu- 
lation of an ILP problem by propositionalization. Indeed, propositionalization 
of indeterminate languages yields multi-instance problems of exponential size as 
the covering test is NP-complete in these languages (section 2) . A control of the 
cost of propositionalization must be done by using languages with bounded inde- 
terminism. Some of these languages have been well studied by [3] and we propose 
to use them as a basis for new languages for the macro-operator approach. The 
resulting expected trade-offs between the cost of building macros and the cost of 
learning is presented in figure 2. The cost of building macro-operators has to be 
understood as the complexity of defining macros in order to obtain a consistent 
representation of the new problem. 



Complexity 
of building 
macro-operators 



A ’’classical” 
9 ILP 



determinate 



o 

[ 17 . 18 ] 
[20,21] 

[ 16 . 19 ] 
[ 23 , 41 ] 



O 



k-locale 



k-free l -indeterminate 



o 



o 

indeterminate 



[ 26 , 33 ] 



[ 29 , 30 ] 



-e- 



AV 



Multi-table 



Multi-instance 



■e — - 

Complexity 
of learning 



Fig. 2. Trade-offs between cost of building macros and cost of learning 



On one side of the range of possible trade-offs, we have ’’classical” ILP sys- 
tems. Indeed, we can see a clausal theory as a set of macro-operators 3 and a 
propositionalization can be performed. Learning is straightforward, classifying a 
new example as positive if one of its attributes is true, and negative otherwise, 

3 This approach has been used early in ILP and has been refined recently in the 
’’repeat learning framework” [14]. 
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that is to say if at least one of the clauses covers the example 4 . This example 
illustrates the cost of building the macros necessary to obtain a trivial represen- 
tation of the examples. 

On the opposite side, we have the approaches of [29,30] that do not build any 
macros, or more precisely in our framework, define directly a macro as a literal. 
The cost of learning is then entirely delegated to a multi-instance algorithm 
which has to deal with data of exponential size in the general case. 

In between these two extreme sides, we see the different classes of approaches 
which depend on the expressivity of the representation language of the hypothesis 
base. In the next section, we discuss the limitations of the first class that targets 
the determinate language to delegate some cost to AV algorithms [17,18,16,19,20, 
21,22,23]. We then evaluate the interest of the languages we propose, focusing on 
the /c-locale language which is the first promising relaxation of the determinate 
language. 

5 Limitations of the Determinate Language 

Intuitively, the cost of building macros in order to get a consistent representation 
of the learning problem in a determinate language is very high. It comes from the 
fact that the new representation describes the example as a whole, as a unique 
entity, what [20] named an individual-centered representation. That is to say, 
the macros must define global properties of the learning example. If we have a 
look at a macro for the determinate language : 

short -CarJw/roof{T) : — car(T, Cl), short(Cl), roof (Cl) 

We see that it describes a property of the train, of the example as a whole: the 
train has the property of having a short car with a roof. This impacts the cost 
in two ways. 

5.1 Complexity of Numerical Learning 

The numerical learning capacity is very limited. We can have a macro describing 
the total number of loads that the train carries, but not the number of loads 
carried by a car of the train for example. In other words, the numerical learning 
part delegated to the learning algorithm is limited only to the properties of 
the example as a whole. For example, as a workaround, the RELAGGS system 
[21] computes statistics like the average and the maximum of numerical values 
appearing in the example. To do numerical learning on parts of the example, the 
ILP system must generate macro-operators such as: 

car - 2 + doads{T) : — car{T 1 Cl), #loads(Cl, N), N > 3 

to learn that a car must carry more than 2 loads. 

4 This is very similar to the table of macro-operators defined by [11] who notes that 
once the table is computed, no search has to be done to solve the problem. 
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A solution to this problem is to delegate the complexity of numerical learning 
to multi-instance algorithms as in [27,28] for example. Numerical variables are 
then introduced in the hypothesis base (introducing indeterminism) and the 
set of all their matchings is gathered under a tabular format. As an example, 
learning constraints on the number of car loads is allowed by the definition of 
the macro: 

#carJocids(T, N) : -~car(T,Cl),#locids(Cl, N) 

The associated hypothesis base B = (west(T) e— < #carJoads(T, N), N >) 
would give the following reformulation on a hypothetic ILP problem: 
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A multi-instance algorithm could then induce the concept definition: 
west(T) <— #carJoads(T, N), N > 3 



5.2 Size of Macro-Operators 

A second limitation is that some concepts cannot be learnt by AV algorithms 
and therefore must be represented by a single macro. The cost of building the 
macros is then as expensive as the cost of a classical ILP algorithm. We show 
this case in a simple example. Let us assume that we have to learn the following 
concept: 



west(T) cai'(T,Cl),rect(Cl),car(T,C2),rect(C2),Cl ^ C2 

The concept is that a train has two different rectangular cars. We can see that 
it cannot be learned by building macros defined with a subset of the concept’s 
literals. The only way to prevent the instantiations of the two cars on a same 
car is to form a unique macro equivalent to the definition of the concept: 

2 jrect-cars(T) : —car(T,Cl),rect(Cl),car(T,C2),rect(Cl),Cl ^ C2 

By relaxing the constraint of determinism, one can build two macros less com- 
plex, defining a rectangular car each, and obtain a multi-intance problem. 

We see that in both cases the use of a richer representation language allows to 
better use the learning algorithms. The first promising language that we consider 
is the k-locale language. 
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6 The fc-locale Language 

The fc-locale language has been proposed by [3] to introduce a bounded inde- 
terminism in each locale of a clause. A locale is defined as a maximal set of 
literals where their free variables do not appear in other locales. By bounding 
the maximum number of literals in a locale by fc, the covering test complexity 
of a fc-locale clause is exponential in k, the covering test of each locale being 
independent. In this language, the hypothesis base is a k-locale of the form 
B = (cc < LOCi ,. . . , LOC n >), where each LOCi, a conjunction of macros, 
is a locale. However, as opposed to the definition of [3], the number of matching 
substitutions of a locale depends here on its number of variables (the saturation 
process being implicit) . We therefore redefine the notion of locality as follows: 

Definition 3 (fc-locale base). Let the k-locale hypothesis base be: 

B = ( tc <r- < LOCi, • ■ • , LOC n >) 

We have Vi € {1, . . . ,n},LOCi is a locale and k > ( vars(LOCi)\vars(tc )) 

This language allows to define macros that introduce a bounded indeterminism 
at the level of each locale of B. For example, B is a 2-locale with its 2-locales 
underlined: 



B = ( west(T ) I— < short-car(T, Cl) , roof (Cl) , 

short_car(T , C2), #loads(C2, N),N >) 

This definition of locality is similar to [42], who used the principle of locality 
as a decomposition technique to improve the covering test. This decomposition 
technique is classical to decompose a problem into a set of sub-problems (see 
e.g. [43]), and we will see the additional benefit of fc-locale bases on proposition- 
alization. 

By definition of the locality, while propositionalizing an ILP problem with 
B , the instantiation of variable Cl does not constraint the instantiations of 
variable C 2. Therefore, it is more efficient to represent the new instance space 
by a product representation. 

Let us consider the following train problem example: 



Examples 


Background knowledge 


e + west(tl) 
e~ west(t2) 


car(tl,cll). roof(cll). rect(cll). 
car(tl,cl2 ). short{c!2). #loads(cl2,2). 
car(t2, c21). short\c21). roof(c21). 
car(t2, c22). rect(c22). #loads(c22,3). 



The propositionalization must take into account the locales to generate for each 
of them a multi- instance problem as we show in figure 3. This representation of 
the problem has only two lines at the maximum per example, instead of four 
in its developed representation, which would be obtained by naive proposition- 
alization. This representation, obtained by decomposition of a multi-instance 
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problem, is simpler and exponentially more compact than the developed multi- 
instance problem. We name this new representation the multi-table representa- 
tion, each table being a multi-instance problem. Note that we do not see the 
multi-table representation as a generalization of the multi-instance representa- 
tion, but as a representation in between AV and multi-instance, as a result of 
a product representation of a multi-instance problem, like other decomposition 
techniques (see [42,43]). We present some learnability result in this language. 
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Fig. 3. A multi-table representation and its developed multi-instance one 



6.1 On Learnability of Multi-table Problems 

A multi-table representation is a vector < T\ , . . . , T n > where each coordinate is 
a multi-instance problem. An example is of the form e =< b\, . . . , b n > with bi a 
bag of AV instances defined in the instance space of 1). The number of instances 
in each bag is variable and depends on the table. 

No learning algorithms are devoted to this representation yet, but it is easy 
to extrapolate with the expertise gained from the design of multi-instance al- 
gorithms from AV algorithms. As with the multi-instance representation, it is 
easy to adapt top-down generate-and-test algorithms as noted in [44,45] : these 
algorithms search the same hypothesis space but adapt the covering test to the 
multi-table representation. 

Much in line with Cohen’s work on PAC-learnability of single clause in the 
k- locale language, an interesting restriction of the multi-table representation is 
to bound the number of attributes by a constant l, for any number of tables. In 
the macro-operator approach, this restriction corresponds to bounding the size 
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of the locales in the hypothesis base, but not its size. In this language, we show 
that monomials are PAC-learnable and then efficient algorithms exist. We give 
the proof for completeness as we work with the multi-table representation and 
that we do not have the same parameters as Cohen (/ and k). Following, [3], this 
concept language is noted P\^multi-table- 

Theorem 1. For any fixed l, the language family C 1-multi-table PAC- 
learnable. 



Proof 1. We use the proof technique given by [46] (lemma 1 ): we reduce learning 
in C]_ mult i -table t° learning monomials in a boolean space, which is known 
to be PAC-learnable. 

Lemma 1 (from [46]). Let Chi and Ch 2 be two hypothesis languages, C e \ and 
C e 2 their associated languages of examples and C € Chi the target concept. Let 
fi : C e i — > C e 2 and f c : Chi Ch 2 be the two reduction functions. 

If Chi reduces to Ch 2 and Ch 2 is PAC-learnable and / c _1 (C) is computable 
in polynomial time then Chi is PAC-learnable. 

We consider that all tables have l attributes, possibly completed with neutral 
attributes for learning. We build a boolean representation of the multi-table 
problem by associating to each hypothesis of the search space a boolean attribute. 
As the hypothesis space is a direct product of the hypothesis spaces of each table, 
the boolean representation has n x 2 21 attributes. To obtain those attributes, for 
each table Ti containing the l boolean attributes [tn, . . . ,tu}, we define a set of 
21 boolean attributes Ai = {tn, . . . , tu,->ti i, . . . , —>tu}. For each set Ai, we define 
a set Bi of 2 21 attributes, each attribute corresponding to a hypothesis of Ti built 
on Ai. 

fi : Each example is redescribed with the n x 2 21 attributes (B i, . . . ,B n ) where 
an attribute is true if the corresponding hypothesis covers the example and 
false otherwise. 

f c : A concept is of the form < ci, . . . ,c n >. By construction, each Ci matches 
an attribute b £ Bi and Ci covers the example iff b is true. f c preserve the 
concept extension and is easily reversible in polynomial time. 

7 Discussion 

Besides the fc-locale language, all languages with bounded indeterminism (i.e. 
yielding bounded multi-instance problems through propositionalization) are can- 
didate for the language of the hypothesis space targeted by the macro-operators. 
The relative benefit of these different languages with respect to the trade-off be- 
tween the cost of building macros and the cost of learning is still an open question 
and a good amount of future works will be to experiment with them. In figure 
2, we extrapolate on the trade-offs we can expect from two other languages 




22 



E. Alphonse 



adapted from [3]: the fc-free and the /-indeterminate languages. These languages 
are complementary in the way they bound the indeterminism. The first language, 
bounding the number of free variables in B , bounds theoretically the size of the 
multi-instance problem by an exponential in k. The second one directly bounds 
the size of the multi-instance problem by /, which is the number of matchings 
between B and the examples. As opposed to the former, it takes into account 
the background knowledge and the type of the variables, and therefore allows 
for a tighter upper bound. 

As an example, one simple and efficient use of these languages is to introduce 
indeterminism only with the numerical variables to delegate numerical learning 
to multi-instance algorithms. 



8 Conclusion 

It is known that ILP is prone to plateau phenomenas during heuristic search and 
macro-operators are well motivated to reduce these problems. We have proposed 
a paradigm to solve ILP problems by means of macro-operators. As opposed to 
previous approaches, the macros are not used to extend the refinement graph 
but only to select a relevant sub-graph. We have shown that we can associate a 
representation language to the sub-graph and study a macro-operator approach 
in terms of the expressivity of this representation language. We have shown that 
a large set of ILP techniques [17,18,16,19,20,21,22,23] are a particular case of the 
paradigm. They define macros to reformulate ILP problems in a determinate lan- 
guage and delegate learning by propositionalization to AV learning algorithms, 
suitable for determinate languages. However, their cost of building macros is in 
the worst case as expensive as ’’classical” ILP. Notably, numerical learning has to 
be done in the ILP search space and cannot be delegated to AV algorithms. We 
have then proposed a range of different languages adapted from [3] that allows 
better trade-offs between the cost of building macros and the cost of learning. 

The first promising trade-off is the k-locale language that reformulates ILP 
problems to a new representation language we have named multi-table. It is a 
product representation of a multi-instance problem and can represent the same 
amount of information in a more compact way. Moreover, we have shown that 
monomials was PAC-learnable if the number of attributes per table was fixed, 
which suggests that efficient learning algorithms are available. Such a restriction 
appears naturally in the propositionalization framework if we limit the size of 
each locale in the hypothesis space. We think that the definition of the multi- 
table representation goes beyond its use in ILP and may be used as learning 
language as well. 

From our point of view, the important effort put in designing macro-operators 
for the determinate language for the last ten years can be pursued in richer lan- 
guages that lift their current shortcomings. ILP can benefit from the problem 
solving community by reusing techniques to build macro-operators and then 
delegate learning to simpler learning algorithms like AV, multi-table and multi- 
instance algorithms. To start experimenting with the different classes of ap- 
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proach, we will work on a multi-table algorithm and develop techniques to build 
macro-operators for the fc-locale language. 
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Abstract. The LogAn-H system is a bottom up ILP system for learn- 
ing multi-clause and multi-predicate function free Horn expressions in 
the framework of learning from interpretations. The paper introduces a 
new implementation of the same base algorithm which gives several or- 
ders of magnitude speedup as well as extending the capabilities of the 
system. New tools include several fast engines for subsumption tests, 
handling real valued features, and pruning. We also discuss using data 
from the standard ILP setting in our framework, which in some cases 
allows for further speedup. The efficacy of the system is demonstrated 
on several ILP datasets. 



1 Introduction 

Inductive Logic Programming (ILP) has established a core set of methods and 
systems that proved useful in a variety of applications [20,4]. Early work in the 
Golem system [21] (see also [19]) used Plotkin’s [24] least general generalization 
(LGG) within a bottom up search to find a hypothesis consistent with the data. 
On the other hand, much of the research following this (e.g. [25,18,7,3]) has used 
top down search methods to find useful hypotheses. However, several exceptions 
exist. STILL [26] uses a disjunctive version space approach which means that it 
has clauses based on examples but it does not generalize them explicitly. The 
system of [1] uses bottom up search with some ad hoc heuristics to solve the 
challenge problems of [11]. The LogAn-H system [13] is based on an algorithm 
developed in the setting of learning with queries [12] but uses heuristics to avoid 
asking queries and instead uses a dataset as input. This system uses a bottom up 
search, based on inner products of examples which are closely related to LGG. 
Another important feature of LogAn-H is that it does a refinement search but, 
unlike other approaches, it takes large refinement steps instead of minimal ones. 

In previous work [13] LogAn-H was shown to be useful in a few small do- 
mains. However, it was hard to use the system in larger applications mainly due 
to high run time. One of the major factors in this is the cost of subsumption. Like 
other bottom up approaches, LogAn-H may use very long clauses early on in 
the search and the cost of subsumption tests for these is high. This is in contrast 
to top down approaches that start with short clauses for which subsumption is 
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easy. A related difficulty observed in Golem [21] is that LGG can lead to very 
large hypotheses. In LogAn-H this is avoided by using 1-1 object mappings. 
This helps reduce the size of the hypothesis but gives an increase in complexity 
in terms of the size of the search. 

The current paper explores a few new heuristics and extensions to LogAn-H 
that make it more widely applicable both in terms of speed and range of applica- 
tions. The paper describes a new implementation that includes several improved 
subsumption tests. In particular, for LogAn-H we need a subsumption proce- 
dure that finds all substitutions between a given clause and an example. This 
suggests a memory based approach that collects all substitutions simultaneously 
instead of using backtracking search. Our system includes such a procedure which 
is based on viewing partial substitutions as tables and performing “joins” of such 
tables to grow complete substitutions. A similar table-based method was devel- 
oped in [ 8 ]. This approach can be slow or even run out of memory if there are 
too many partial substitutions in any intermediate step. Our system implements 
heuristics to tackle this, including lookahead search and randomized heuristics. 
The latter uses informed sampling from partial substitution tables if memory 
requirements are too large. In addition, for some applications it is sufficient to 
test for existence of substitutions between a given clause and an example (i.e. 
we do not need all substitutions). In these applications we are able to use the 
fast subsumption test engine Django [16] in our system. The paper shows that 
different engines can give better performance in different applications, and gives 
some paradigmatic cases where such differences occur. 

In addition the system includes new heuristics and facilities including dis- 
cretization of real valued arguments and pruning of rules. Both introduce inter- 
esting issues for bottom up learning which do not exist for top down systems. 
These are explored experimentally and discussed in the body of the paper. 

The performance of the system is demonstrated in three domains: the Bon- 
garcl domain [7,13], the KRK-illegal domain [25], and the Mutagenesis domain 
[29]. All these have been used before with other ILP systems. The results show 
that our system is competitive with previous approaches while applying a com- 
pletely different algorithmic approach. This suggests that bottom up approaches 
can indeed be used in large applications. The results and directions for future 
work are further discussed in the concluding section of the paper. 



2 Learning from Interpretations 

We briefly recall the setup in Learning from Interpretations [ 6 ] and introduce 
a running example which will help explain the algorithm. The task is to learn 
a universally quantified function-free Horn expression, that is, a conjunction of 
Horn clauses. The learning problem involves a finite set of predicates whose 
signature, i.e. names and arities, are fixed in advance. In the examples that 
follow we assume two predicates p() and q() both of arity 2. For example, C\ = 
Vaq, Vx 2 , Va’ 3 , [p(x 1 , 2 : 2 ) Ap(i 2 d 3 ) — l ► p(x 1 , 2 : 3 )] is a clause in the language. An 
example is an interpretation listing a domain of elements and the extension of 
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predicates over them. The example ei=([l,2,3], [[p(l, 2 ),p( 2 , 3),p(3, 1), g(l, 3)]]) 
describes an interpretation with domain [1,2,3] and where the four atoms listed 
are true in the interpretation and other atoms are false. The size of an example 
is the number of atoms true in it, so that s*2e(ei)=4. The example ei falsifies 
the clause above (substitute {1/xi,2/x 2 , 3/2:3}), so it is a negative example. On 
the other hand e2=([a, b, c, d], [[p(a, b),p(b, c),p(a, c),p(a, d),q(a , c)]]) is a positive 
example. We use standard notation ei Ci and e2 |= ci for these facts. The 
system (the batch algorithm of [13]) performs the standard supervised learning 
task: given a set of positive and negative examples it produces a Horn expression 
as its output. 



3 The Base System 

We first review the basic features of the algorithm and system as described 
in [13]. The algorithm works by constructing an intermediate hypothesis and 
repeatedly refining it until the hypothesis is consistent with the data. The algo- 
rithm’s starting point is the most specific clause set that “covers” a particular 
negative example. A clause set is a set of Horn clauses that have the same an- 
tecedent but different conclusions. In this paper, we use [s, c] and variations of 
it to denote a clause set, where s is a set of atoms (the antecedent) and c is a 
set of atoms (each being a consequent of a clause in the clause set). Once the 
system has some such clauses it searches the dataset for a misclassified example. 
Upon finding one (it is guaranteed to be a negative example) the system tries 
to refine one of its clause sets using a generalization operation which we call 
pairing. Pairing is an operation akin to LGG [24] but it controls the size of the 
hypothesis by using a restriction imposed by a one to one object correspondence. 
If pairing succeeds, that is, the refinement is found to be good, the algorithm 
restarts the search for misclassified examples. If pairing did not produce a good 
clause, the system adds a new most specific clause set to the hypothesis. This 
process of refinements continues until no more examples are misclassified. 

To perform the above the system needs to refer to the dataset in order to 
evaluate whether the result of refinements, a proposed clause set, is useful or not. 
This is performed by an operation we call one-pass. In addition the algorithm 
uses an initial “minimization” stage where candidate clause sets are reduced in 
size. The high level structure of the algorithm is given in Figure 3. We proceed 
with details of the various operations as required by the algorithm. 

Candidate clauses: For an interpretation /, rel-ant(I) is a conjunction of 
positive literals obtained by listing all atoms true in I and replacing each object 
in / with a distinct variable. So rel-ant(e 1) = p(x±,X2) Ap(i2d3) Ap(i3,ii) A 
q(xi, x 3 ). Let X be the set of variables corresponding to the domain of I in this 
transformation. The set of candidate clauses rel-cands(I) includes clauses of the 
form (rel-ant(I) —> p(Y )), where p is a predicate, Y is a tuple of variables from X 
of the appropriate arity and p(Y) is not in rel-ant(I). For example, rel-cands(e 1) 
includes among others the clauses \jp(x\, X2) A p{ X2, 2:3) A p( x 3 , x\) A q(x±, x 3 ) — > 
p(x 2,2:2)], and \p( xi,x 2 ) Ap(x 2 ,x 3 ) Ap(i 3 ,ii) A 9(2:1, x 3 ) -» 9(2:3, *1)], where 
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1. Initialize S to be the empty sequence. 

2. Repeat until H is correct on all examples in E. 

a) Let H = variabilize(S). 

b) If H misclassifies I ( I is negative but I |= H): 

i. [s,c\ = one-pass(rel-cands(I)). 

ii. [s,c] = minimize- objects([s, c]). 

iii. For i = 1 to m (where S = ([si, Ci], . . . , [s m , c m ])) 

For every pairing J of Si and s 
If J’s size is smaller than Si’s size then 
let [b,c] = one-pas s{[J,Ci U (s^ \ J)]). 

If c is not empty then 

A. Replace [ Si,d ] with [s,c]. 

B. Quit loop (Go to Step 2a) 

iv. If no Si was replaced then add [s, c] as the last element of S. 



Fig. 1 . The Learning Algorithm (Input: example set E ) 



all variables are universally quantified. Note that there is a 1-1 correspondence 
between a ground clause set [s, c] and its variabilized versions. We refer to the 
variabilization using variabilize(-) . In the following we just use [s,c] with the 
implicit understanding that the appropriate version is used. 

As described above, any predicate in the signature can be used as a conse- 
quent by the system. However, in specific domains the user often knows which 
predicates should appear as consequents. To match this, the system allows the 
user to specify which predicates are allowed as consequents of clauses. Natu- 
rally, this improves run time by avoiding the generation, validation and deletion 
of useless clauses. 

The one-pass procedure: Given a clause set [s, c] one-pass tests clauses in 
[s, c] against all positive examples in E. The basic observation is that if a positive 
example can be matched to the antecedent but one of the consequents is false 
in the example under this matching then this consequent is wrong. For each 
example e, the procedure one-pass removes all wrong consequents identified by 
e from c. If c is empty at any point then the process stops and [s, 0] is returned. 
At the end of one-pass, each consequent is correct w.r.t. the dataset. 

This operation is at the heart of the algorithm since the hypothesis and 
candidate clause sets are repeatedly evaluated against the dataset. Two points 
are worth noting here. First, once we match the antecedent we can test all the 
consequents simultaneously so it is better to keep clause sets together rather 
than split them into individual clauses. Second, notice that since we must verify 
that consequents are correct, it is not enough to find just one substitution from 
an example to the antecedent. Rather we must check all such substitutions before 
declaring that some consequents are not contradicted. This is an issue that affects 
the implementation and will be discussed further below. 

Minimization: Minimization takes a clause set [s,c] and reduces its size so 
that it includes as few objects as possible while still having at least one cor- 
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rect consequent. This is clone by “dropping objects”. For example, for [s, c] = 
[[p(l,2),p(2, 3),p(3, 1), <?(1,3)], [p(2, 2),g(3, 1)]], we can drop object 1 and all 
atoms using it to get [s, c] = [[p(2, 3)], [p(2, 2)]]. The system iteratively tries 
to drop each domain element. In each iteration it drops an object to get [s' ,d\, 
runs one-pass on [s' ,d] to get [s" ,d'[. If c" is not empty it continues with it to 
the next iteration (assigning [s,c] <— [s",c"]); otherwise it continues with [s, c]. 
Pairing: The pairing operation combines two clause sets [s 0 ,c a ] and [sb,Cb] to 
create a new clause set [s p , c p \. When pairing we utilize an injective mapping from 
the smaller domain to the larger one. The system first pairs the antecedents by 
taking the intersection under the injective mapping (using names from [s 0 ,c a ]) 
to produce a new antecedent J. The resulting clause set is [s p ,c p \ = [ J, (c 0 fl 
Cb) U (s a \ J)]. To illustrate this, the following example shows the two original 
clauses, a mapping and the resulting values of J and [s p ,c p ]. 

- [8a, Ca ] = Ml, 2),p(2, 3),p(3, 1), 5(1, 3)], [p{ 2, 2), q{3, 1)]] 

- [s b ,c 6 ] = [[p(a,b),p(b,c),p(a,c),p(a,d),q(a,c)\,[q(c,a)]\ 

- The mapping {1/a, 2/6, 3/c} 

- J= [p(l, 2),p(2, 3), 5(1, 3)] 

- [ 8p , Cp ] = [[p( 1, 2),p(2, 3), g(l, 3)], fe(3, l),p(3, 1)] 

The clause set [s p , c p \ obtained by the pairing can be more general than the 
original clause sets [s a ,c a \ and [sb,Cb] since s p is contained in both s a and Sb 
(under the injective mapping). Hence, the pairing operation can be intuitively 
viewed as a generalization of both participating clause sets. However since we 
modify the consequent, by dropping some atoms and adding other atoms (from 
s a \J), this is not a pure generalization operation. 

Clearly, any two examples have a pairing for each injective mapping of domain 
elements. The system reduces the number of pairings that are tested by using 
only “live pairings” where each object appears in the extension of at least one 
atom in J. The details are described in [13]. 

Caching: The operation of the algorithm may produce repeated calls to one-pass 
with the same antecedent since pairings of one clause set with several others may 
result in the same clause set. Thus it makes sense to cache the results of one-pass. 
We cache ground versions of the s part of the clause set if one-pass determined 
that no consequents are true for it. So in such a case we get an immediate answer 
in future calls. The details are similar to [13]. 

4 Performance Issues and Applicability 

While the results reported in [13] were encouraging several aspects precluded 
immediate application to real world data sets. 

The Subsumption Test: Perhaps the most important issue is run time which 
is dominated by the cost of subsumption tests and the one-pass procedure. It 
is well known that testing subsumption is NP-Harcl [14] and therefore we do 
not expect a solution in the general case. However it is useful to look at the 
crucial parameters. In general subsumption scales exponentially in the number 
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of variables in the clauses but polynomially with the number of predicates [23] . 
The problem is made worse in our system because of the bottom-up nature of the 
learning process 1 . When we generate the most specific clause for an example, 
the number of variables in the clause is the same as the number of objects 
in the example and this can be quite large in some domains. In some sense 
the minimization process tries to overcome this problem by removing as many 
objects as possible (this fact is used in [12] to prove good complexity bounds). 
However the minimization process itself runs one-pass and therefore forms a 
bottleneck. In addition, for one-pass it is important to find all substitutions 
between a clause and an example. Therefore, a normal subsumption test that 
checks for the existence of a substitution is not sufficient. For problems with 
highly redundant structure the number of substitutions can grow exponentially 
with the number of predicates so this can be prohibitively expensive. Thus an 
efficient solution for one-pass is crucial in applications with large examples. 
Suitability of Datasets and Overfitting: The system starts with the most 
specific clauses and then removes parts of them in the process of generaliza- 
tion. In this process, a subset of an interpretation that was a negative example 
becomes the s in the input to one-pass. If examples matching s exist in the 
dataset then we may get a correct answer from one-pass. In fact if a dataset 
is “downward closed”, that is all such subsets exist as examples in the dataset, 
the system will find the correct expression. Note that we only need such subsets 
which are positive examples to exist in the dataset and that it is also sufficient to 
have isomorphic embeddings of such subsets in other positive examples as long 
as wrong consequents are missing. Under these conditions all calls to one-pass 
correctly identify all consequents of the clause. Of course, this is a pretty strong 
requirement but as demonstrated by experiments in [13] having a sample from 
interpretations of different sizes can work very well. 

If this is not the case, e.g. in the challenge problems of [11] where there is a 
small number of large examples, then we are not likely to find positive examples 
matching subsets of the negative ones (at least in the initial stages of minimiza- 
tion) and this can lead to overfitting. This has been observed systematically in 
experiments in this domain. 

Using Examples from Normal ILP setting: In the normal ILP setting 
[20] one is given a database as background knowledge and examples are simple 
atoms. We transform these into a set of interpretations as follows (see also [5, 
12]). The background knowledge in the normal ILP setting can be typically 
partitioned into different subsets such that each subset affects a single example 
only. A similar effect is achieved for intensional background knowledge in the 
Progol system [18] by using mode declarations to limit antecedent structure. 
Given example b , we will denote BK(b) as the set of atoms in the background 
knowledge that is relevant to b. In the normal ILP setting we have to find a 
theory T s.t. BK(b) U T |= b if b is a positive example, and BK(b) U T b if b 
is negative. Equivalently, T must be such that T [= BK(b) — > b if b is positive 
and T Y= BK(b) — >■ b if b is negative. 

The same is true for other bottom up systems; see for example the discussion in [26]. 



l 
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Space limitations preclude discussion of the general case so we consider only 
the following special case: Several ILP domains are formalized using a consequent 
of arity 1 where the argument is an object that identifies the example in the 
background knowledge. Since we separate the examples into interpretations this 
yields a consequent of arity 0 for LogAn-H . In this case, if b is a positive 
example in the standard ILP setting then we can construct an interpretation 
I = ([F], [BK(b)]) where F is the set of objects appearing in BI\(b), and label 
I as negative. If b is a negative example, we construct an interpretation / = 
([F], [BG(b)]) and label it positive. For a single possible consequent of arity 0 
this captures exactly the same information. As an example, suppose that in the 
normal ILP setting, the clause p(a, b ) A p(b, c) —> q() is labeled positive and the 
clause p(a, b) q() is labeled negative. Then, the transformed dataset contains: 
([a,b,c],\p(a,b),p(b,c)])~ and ([a,b],\p(a,b)])+. 

In the case of zero arity consequents, the check whether a given clause C 
is satisfied by some interpretation I can be considerably simplified. Instead of 
checking all substitutions it suffices to check for existence of some substitution, 
since any such substitution will remove the single nullary consequent. This has 
important implications for the implementation. In addition, note that the pairing 
operation never moves new atoms into the consequent and is therefore a pure 
generalization operation in this case. 

5 Further Improvements 

5.1 The Subsumption Test 

Table Based Subsumption: While backtracking search (as done in Prolog) 
can find all substitutions without substantial space overhead, the time overhead 
can be large. Our system implements an alternative approach that constructs all 
substitutions simultaneously and stores them in memory. The system maintains a 
table of instantiations for each predicate in the examples. To compute all substi- 
tutions between an example and a clause the system repeatedly performs joins of 
these tables (in the database sense) to get a table of all substitutions. We first ini- 
tialize to an empty table of substitutions. Then for each predicate in the clause we 
pull the appropriate table from the example, and perform a join which matches 
the variables already instantiated in our intermediate table. Thus if the predicate 
in the clause does not introduce new variables the table size cannot grow. Other- 
wise the table can grow and repeated joins can lead to large tables. To illustrate 
this consider evaluating the clause p(xi,x 2 ),p{x 2 ,x{),p(xi % xz) % p(xz,xi) on an 
example with extension [p(a, b),p(a, c),p(a, d),p(b , a),p(d, c)]. Then applying the 
join from left to right we get partial substitution tables (from left to right): 
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Notice how the first application simply copies the table from the extension of 
the predicate in the example. The first join reduces the size of the intermediate 
table. The next join expands both lines. The last join drops the row with a b c 
but expands other rows so that overall the table expands. 

One can easily construct examples where the table in intermediate steps is 
larger than the memory capacity of the computer, sometimes even if the final 
table is small. In this case the matching procedure will fail. If it does not fail, 
however, the procedure can be fast since we have no backtracking overhead and 
we consider many constraints simultaneously. 

Nonetheless this is not something we can ignore. We have observed such 
large table sizes in the mutagenesis domain [29] as well as the artificial challenge 
problems of [11]. Note that a backtracking search will not crash in this case 
but on the other hand it may just take too long computationally so it is not 
necessarily a good approach (this was observed in the implementation of [13]). 
Lookahead: As in the case of database queries one can try to order the joins in 
order to optimize the computation time and space requirements. Our system can 
perform a form of one step lookahead by estimating the size of the table when 
using a join with each of the atoms on the clause and choosing the minimal 
one. This introduces a tradeoff in run time. On one hand the resulting tables in 
intermediate steps tend to be smaller and therefore there is less information to 
process and the test is quicker. On the other hand the cost of one step lookahead 
is not negligible so it can slow down the program. The behavior depends on the 
dataset in question. In general however it can allow us to solve problems which 
are otherwise unsol vable with the basic approach. 

Randomized Table Based Subsumption: If the greedy solution is still not 
sufficient or too slow we can resort to randomized subsumption tests. Instead 
of finding all substitutions we try to sample from the set of legal substitutions. 
This is done in the following manner: if the size of the intermediate table grows 
beyond a threshold parameter ‘TH’ (controlled by the user), then we throw 
away a random subset of the rows before continuing with the join operations. 
The maximum size of intermediate tables is THxl6. In this way we are not 
performing a completely random choice over possible substitutions. Instead we 
are informing the choice by our intermediate table. In addition the system uses 
random restarts to improve confidence as well as allowing more substitutions to 
be found, this can be controlled by the user through a parameter ‘R’. 

Using Django: The system Django [16] uses ideas from constraint satisfaction 
to solve the subsumption problem. Django only solves the existence question 
and does not give all substitutions, but as discussed above this is sufficient for 
certain applications coming from the standard ILP setting. We have integrated 
the Django code, generously provided by Jerome Maloberti, as a module in our 
system. 

5.2 Discretization 

The system includes a capability for handling numerical data by means of dis- 
cretization. Several approaches to discretization have been proposed in the lit- 
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erature [10,9,2]. We have implemented the simple “equal frequency” approach 
that generates a given number of bins (specified by the user) and assigns the 
boundaries by giving each bin the same number of occurrences of values. 

To do this for relational data we first divide the numerical attributes into 
“logical groups”. For example the rows of a chess board will belong to the 
same group regardless of the predicate and argument in which they appear. 
This generalizes the basic setup where each argument of each predicate is dis- 
cretized separately. The dataset is annotated to reflect this grouping and the 
preferred number of thresholds is specified by the user. The system then de- 
termines the threshold values, allocates the same name to all objects in each 
range, and adds predicates reflecting the relation of the value to the thresholds. 
For example, discretizing the logp attribute in the mutagenesis domain with 4 
thresholds (5 ranges), a value between threshold 1 and threshold 2 will yield: 
[logp (logp _val . 02) , logp _val>00 (logp _val . 02) , logp_val>01 (logp_val . 02) , 
logp _val<02 (logp _val . 02) , logp_val<03(logp_val . 02) , ...]. 

Notice that we are using both > and < predicates so that the hypothesis can 
encode intervals of values. 

An interesting aspect arises when using discretization which highlights the 
way our system works. Recall that the system starts with an example and es- 
sentially turns objects into variables in the maximally specific clause set. It then 
evaluates this clause on other examples. Since we do not expect examples to 
be identical or very close, the above relies on the universal quantification to al- 
low matching one structure into another. However, the effect of discretization 
is to ground the value of the discretized object. For example, if we discretized 
the logp attribute from above and variabilize we get logp(X) logp_val>00(X) 
logp_val>01 (X) logp_val<02 (X) logp_val<03 (X) . Thus unless we drop some of the 
boundary constraints this limits matching examples to have a value in the same 
bin. We are therefore losing the power of universal quantification. As a result 
fewer positive examples will match in the early stages of the minimization pro- 
cess, fewer consequents will be removed, and the system may be led to overfitting 
by dropping the wrong objects. This is discussed further in the experimental sec- 
tion. 

5.3 Pruning 

The system performs bottom-up search and may stop with relatively long rules 
if the data is not sufficiently rich (i.e. we do not have enough negative examples) 
to warrant further refinement of the rules. Pruning allows us to drop additional 
parts of rules. The system can perform a greedy reduced error pruning [17] 
using a validation dataset. For each atom in the rule (in some order) the system 
evaluates whether the removal of the atom increases the error on the validation 
set. If not the atom can be removed. While it is natural to allow an increase in 
error using a tradeoff against the length of the hypothesis in an MDL fashion, 
we have not yet experimented with this possibility. 

Notice that unlike top down systems we can perform this pruning on the 
training set and do not necessarily need a separate validation set. In a top down 
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system one grows the rules until they are consistent with the data. Thus, any 
pruning will lead to an increase in training set error. On the other hand in 
a bottom up system, pruning is similar to the main stage of the algorithm in 
that it further generalizes the rules. In some sense, pruning on the training set 
allows us to move from a most specific hypothesis to a most general hypothesis 
that matches the data. Both training set pruning and validation set pruning are 
possible with our system. 

5.4 Consistency Checks 

If the input dataset is inconsistent, step (i) of the algorithm may produce an 
initial version of the most specific clauses set with an empty list of consequents. 
Similar problems may arise with the randomized subsumption tests. The sys- 
tem includes simple mechanisms for ignoring such examples once a problem is 
detected. 

6 Experiments 

The LogAn-H system of [13] implements the algorithm in Section 3 using Prolog 
and its backtracking search engine. Our new system includes a C implementation 
of the ideas described above. 

6.1 Bongard Problems 

To illustrate the improvement in efficiency of the new system w.r.t. the previous 
implementation, we re-ran experiments done with artificial data akin to Bongard 
problems [13]. This domain was introduced previously in the ICL system [7]. In 
this domain an example is a “picture” composed of objects of various shapes 
(triangle, circle or square), triangles have a configuration (up or down) and each 
object has a color (black or white). Each picture has several objects (the number 
is not fixed) and some objects are inside other objects. For these experiments we 
generated random examples, where each parameter in each example was chosen 
uniformly at random. In particular we used between 2 and 6 objects, the shape 
color and configuration were chosen randomly, and each object is inside some 
other object with probability 0.5 where the target was chosen randomly among 
“previous” objects to avoid cycles. Note that since we use a “flattened” function 
free representation the domain size in examples is larger than the number of 
objects (to include: up, down , black, white). We generated (by hand) a target 
Horn expression of 10 clauses, with 9 atoms and 6 variables each. We used this 
Horn expression to label the examples. For example, one of the clauses generated 
in the target expression is 
circle(X) in(X,Y) in(Y,Z) colour{Y,B) 
colour{Z , W) black(B) white(W) in(Z, U) triangle{Y) 

Work in [13] showed that LogAn-H gives good performance on this domain 
and that it outperforms ICL. Running the experiments with the new system 




36 



M. Arias and R. Khardon 



we obtain exactly the same accuracy as before 2 and the speedup observed is 
between one and three orders of magnitude over the Prolog system in compiled 
Sicstus Prolog (which is a fast implementation) when run on the same hardware. 



6.2 Illegal Positions in Chess 

Our next experiment is in the domain of the chess endgame White King and 
Rook versus Black King. The task is to predict whether a given board configu- 
ration represented by the 6 coordinates of the three chess pieces is illegal or not. 
This learning problem has been studied by several authors [22,25]. The dataset 
includes a training set of 10000 examples and a test set of the same size. 

We use the predicate position(a,b,c,d,e,f) to denote that the White 
King is in position (a, b) on the chess board, the White Rook is in position (c, d), 
and the Black King in position (e,f). Additionally, the predicates “less-than” 
lt(x,y) and “adjacent” adj(x,y) denote the relative positions of rows and 
columns on the board. Note that there is an interesting question as how best to 
capture examples in interpretations. In the “all background mode” we include all 
It and adj predicates in the interpretation. In the “relevant background mode” 
we only include those atoms directly relating objects appearing in the head. 

We illustrate the difference with the following example. Consider the config- 
uration “White King is in position (7,6), White Rook is in position (5,0), Black 
King is in position (4,1)” which is illegal. In “all background mode” we use the 
following interpretation: 

[position(7, 6, 5, 0, 4, 1), 

It (0,1) , lt(0,2) , . . , It (0 ,7) , 
lt(l, 2) , lt(l, 3) , . . , It (1 , 7) , 



It (5, 6) , It (5, 7) , 

It (6, 7) , 

adj (0,1) , adj (1,2) , .. , adj (6,7), 
adj (7,6) , adj (6,5) , .. , adj (1,0)]- 

When considering the “relevant background mode” , we include in the exam- 
ples instantiations of It and adj whose arguments appear in the position atom 
directly: 

[position(7, 6, 5, 0, 4, 1), 

It (4, 5) , It (4,7) , It (5, 7), adj (4, 5), adj (5, 4) , 

It (0,1), It (0,6), It (1,6), adj (0,1), adj (1,0)] - 

Table 1 includes results of running our system in both modes. We trained 
LogAn-H on samples with various sizes chosen randomly among the 10000 
available. We report accuracies that result from averaging among 10 runs over 
an independent test set of 10000 examples. Results are reported before and af- 
ter pruning where pruning is done using the training set. Several facts can be 

2 Note that the hypothesis may depend on the order of pairings produced so in prin- 
ciple the results are not guaranteed to be identical. 
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Table 1 . Performance summary for KRK-illegal dataset 





25 


50 


75 


100 


200 


500 


1000 


2000 


3000 


w/o disc., rel. back, mode: 
LogAn-H before pruning 
LogAn-H after pruning 


75.49 

86.52 


88.43 

90.92 


93.01 

94.19 


94.08 

95.52 


97.18 

98.41 


99.54 

99.65 


99.79 

99.79 


99.92 

99.87 


99.96 

99.96 


w/o disc., all back, mode: 
LogAn-H before pruning 
LogAn-H after pruning 


67.18 

79.01 


71.08 

81.65 


75.71 

83.17 


78.94 

82.82 


85.56 

86.02 


94.06 

93.67 


98.10 

96.24 


99.38 

98.10 


99.56 

98.66 


with disc., rel. back, mode: 
LogAn-H before pruning 
LogAn-H after pruning 


43.32 

38.93 


43.70 

42.77 


45.05 

46.46 


44.60 

47.51 


52.39 

56.59 


72.26 

74.29 


84.80 

85.02 


90.30 

90.73 


92.17 

92.59 


with disc., all back, mode: 
LogAn-H before pruning 
LogAn-H after pruning 


67.27 

80.62 


72.69 

86.14 


75.15 

87.42 


78.00 

89.10 


82.68 

90.67 


88.60 

92.25 


91.03 

92.62 


91.81 

92.66 


92.01 

92.74 


FOIL [25] 








92.50 






99.40 







observed in the table. First, we get good learning curves with accuracies im- 
proving with training set size. Second, the results obtained are competitive with 
the results reported for FOIL [25]. Third, relevant background knowledge seems 
to make the task easier. Fourth, pruning considerably improves performance on 
this dataset especially for small training sets. 

Our second set of experiments in this domain illustrates the effect of dis- 
cretization. We have run the same experiments as above but this time with 
the discretization option turned on. Concretely, given an example’s predicate 
position(xl ,x2 ,yl ,y2 , zl ,z2) , we consider the three values corresponding to 
columns (xl,yl,zl) as the same logical attribute and therefore we discretize 
them together. Similarly, we discretize the values of (x2,y2,z2) together. Ver- 
sions of adj () for both column and row values are used. We do not include lt() 
predicates since these are essentially now represented by the threshold predi- 
cates produced by the discretization. As can be seen in Table 1 good accuracy is 
maintained with discretization. However, an interesting point is that now “rele- 
vant background mode” performs much worse than “all background mode”. In 
hindsight one can see that this is a result of the grounding effect of discretizing 
as discussed above. With “relevant background mode” the discretization thresh- 
old predicates and the adjacent predicates are different in every example. Since, 
as explained above, the examples are essentially ground we expect less matches 
between different examples and thus the system is likely to overfit. With “all 
background mode” these predicates do not constrain the matching of examples. 

This domain is also a good case to illustrate the various subsumption tests 
in our system. Note that since we put the position predicate in the antecedent 
the consequent is nullary. Therefore we can use Django as well as the table 
based subsumption and randomized tables. The comparison is given for the 
non-discretized “all background mode” with 1000 training examples. Table 2 
gives accuracy and run time (on Linux running with Pentium IV 2.80 GHz) for 
various subsumption settings averaged over 10 independent runs. For randomized 
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Table 2. Runtime comparison for subsumption tests on KRK-illegal dataset 



Subsumption Engine 


runtime in s. 


accuracy 


actual table size 


Django 


431.6 


98.11% 




Tables 


19.2 


98.11% 


130928 


Lookahead 


25.4 


98.11% 


33530 


No cache 


49.4 


98.11% 




Rand. TH=1 


741.7 


33.61% 


16 


Rand. TH=10 


30.7 


33.61% 


160 


Rand. TH=100 


12.4 


72.05% 


1600 


Rand. TH=1000 


20.3 


98.11% 


16000 



runs TH is the threshold of table size after which sampling is used. As can be 
seen, the table based method is faster than Django (both are deterministic 
and thus give identical hypotheses and accuracy results). The lookahead table 
method incurs some overhead and results in slower execution on this domain, 
however it saves space considerably (see third column of Table 2). Caching gives 
a reduction of about 60% in run time. Running the randomized test with very 
small tables (TH=1) clearly leads to overfitting, and in this case increases run 
time considerably mainly due do the large number of rules induced. On the other 
hand with fairly small tables sizes (TH=1000) the randomized method does very 
well and reproduces the deterministic results. 



6.3 Mutagenesis 

The Mutagenesis dataset is a structure-activity prediction task for molecules 
introduced by [29]. The dataset consists of 188 compounds, labeled as active or 
inactive depending on their level of mutagenic activity. The task is to predict 
whether a given molecule is active or not based on the first-order description of 
the molecule. This dataset has been partitioned into 10 subsets for 10-fold cross 
validation estimates and has been used in this form in many studies (e.g. [29,26, 
7]). We therefore use the same partitions as well. Each example is represented as 
a set of first-order atoms that reflect the atom-bond relation of the compounds 
as well as some interesting global chemical properties. Concretely, we use all the 
information corresponding to the background level B3 of [28]. Notice that the 
original data is given in the normal ILP setting and hence we transformed it as 
described above using a single nullary consequent. In addition, since constants 
are meaningful in this dataset (for example whether an atom is a carbon or 
oxygen) we use a flattened version of the data where we add a predicate for each 
such constant. 

This example representation uses continuous attributes (atom-charge, lumo 
and logp in particular), hence discretization is needed. Although the discretiza- 
tion process is fully automated it requires the number of discrete categories to 
be specified by the user. Here, we use a method that allows us to determine 
this number automatically and without any use of the test set: for each par- 
tition of the cross validation, we split the training data into two random sets, 
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Table 3. Runtime comparison for subsumption tests on mutagenesis dataset 



Subsumption Engine 


runtime in s. 


accuracy 


Django 


1162 


87.96% 


Rand. TH=1 


3 


85.52% 


Rand. TH=10 


15 


86.46% 


Rand. TH=100 


19 


89.47% 



one which we call disc-train and consists of 80% of the training data, and 
another called disc-test which consists of the remaining training data. Then, 
for each of the possible values (atom-charge= 5, 15, 25, 35, 45; lumo= 4, 6, 8, 10; 
logp= 4,6,8,10) we train and test over the sets disc-train and disc-test. 
This procedure is repeated 5 times and we choose the discretization values that 
obtain the best average accuracy on this partition. Note that these values might 
be different for different partitions of the global cross validation and indeed we 
did not get a stable choice. Once a set of values is chosen for a particular par- 
tition of the data, the learning process is performed over the entire training set 
and then it is tested on the corresponding independent test set. 

For this domain deterministic table-based subsumption was not possible, 
not even with the lookahead heuristic since the table size grew beyond memory 
capacity of our computer. However, here the Django subsumption engine yields 
good run times. The average training time per fold, after the discretization values 
have been determined, is 14 min. (on Linux running with Pentium IV 2.80 GHz). 
Prediction accuracies obtained for each partition in this fashion are (in order 
from 1 to 10): 73.68%, 89.47%, 78.95%, 84.21%, 84.21%, 89.47%, 89.47%, 73.68%, 
73.68%, 88.24%, which results in a final average of 82.5%. Additionally, we ran 
a regular 10-folcl cross-validation for each combination of discretization values. 
The values atom-charge= 45, lumo= 10 and logp= 4 obtained the best average 
accuracy of 87.96%. Our results compare well to other ILP systems: PROGOL 
[29] reports a total accuracy of 83% with B3 and 88% with B4; STILL [26] 
reports results in the range 85%-88% on B3 depending on the values of various 
tuning parameters, ICL [7] reports an accuracy of 84% and finally [15] report 
that FOIL [25] achieves an accuracy of 83%. 

Here again we ran further experiments with the randomized subsumption 
tests. We used the discretization values atom-charge= 45, lumo= 10 and logp= 
4. Table 3 gives run time (on Linux running with Pentium IV 2.80 GHz) per 
fold and the 10 fold cross validation accuracy with various parameters. One can 
observe that even with small parameters the randomized methods do very well. 
An inspection of the hypothesis to the deterministic runs with Django shows 
that they are very similar. 

6.4 Evaluating Randomized Subsumption Tests 

The experiments above already show that there are cases where the table based 
method is fast and faster than Django even though it searches for all substi- 
tutions compared to just one in Django. On the other hand the table based 
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Table 4. Subsumption run time in linear chain family 





Django 


Tables 


Lookahead 


TH=1 


TH=10 


TH=100 




100.0% 

296s 


100.0% 

242s 

(14161) 


100.0% 

318s 

(118) 








R=1 








6.9% 

13s 


18.6% 

49s 


100.0% 

240s 


R=10 








32.2% 

60s 


66.6% 

181s 


100.0% 

243s 


R=100 








96.9% 

185s 


100.0% 

280s 


100.0% 

241s 



Table 5. Subsumption run time in subgraph isomorphism family 





Django 


TH=1 


TH=10 


TH=100 


TH=1000 




100.00% 

7.1s 










R=1 




0.01% 

0.8s 


3.21% 

1.7s 


31.71% 

8.9s 


85.46% 

52.4s 


R=10 




0.03% 

2.6s 


15.29% 

7.0s 


78.95% 

26.8s 


95.12% 

76.1s 


R=100 




0.22% 

23.6s 


67.88% 

38.9s 


99.92% 

39.1s 


99.97% 

103.1s 



method can be slow in other cases and even run out of memory and fail. The 
following experiments give simple synthetic examples where we compare the sub- 
sumption tests on their own, without reference to the learning system, showing 
similar behavior. In each case we generate a family of problems parametrized 
by size, each having a single example and single clause. We run the subsump- 
tion test 1000 times to observe run time differences as well as accuracies for the 
randomized methods. 

For the first family both example and clause are chains of length n built using 
a binary predicate as in p(x±, x%),p(x 2 , X3), . . . ,p( x n -i,x n ). Thus there is exactly 
one matching substitution. Results for n = 120 are given in Table 4. As can be 
seen, in this case tables are faster than Django, randomized tables work well 
with small parameters, and both table size and repeats (controlled by TH and R 
in Table 4) are effective in increasing the performance of the randomized tests. 
This behavior was observed consistently for different values of n. The numbers 
in parentheses are the actual table sizes needed by the table-based methods; the 
lookahead heuristic saves considerable space. 

The second family is motivated by the mutagenesis domain and essentially 
checks for subgraph isomorphism. The clause is a randomly generated graph 
with n nodes and 3 n edges, and the example is the same set plus 3 n extra edges. 
The results for n = 10 are given in Table 5. Deterministic tables fail for values of 
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n larger than 8 and are omitted. As can be seen Django works very well in this 
case and randomized tables work well even with small parameters, and both table 
size and repeats (contolled by TH and R. in Table 5) are effective in increasing the 
performance of the randomized tests. Similar results were obtained for different 
values of n where randomized tables sometimes achieve high accuracy with lower 
run times than Django though in general Django is faster. 



7 Discussion 

The paper presents a new implementation of the LogAn-H system including 
new subsumption engines, discretization and pruning. Interesting aspects of dis- 
cretization and pruning which are specific to bottom up search are discussed in 
the paper. The system is sufficiently strong to handle large ILP datasets and 
is shown to be competitive with other approaches while using a completely dif- 
ferent algorithmic approach. The paper also demonstrates the merits of having 
several subsumption engines at hand to fit properties of particular domains, and 
gives paradigmatic cases where different engines do better than others. 

As illustrated in [13] using the Bongard domain, LogAn-H is particularly 
suited to domains where substructures of examples in the dataset are likely to 
be in the dataset as well. On the other hand, for problems with a small number 
of examples where each example has a large number of objects and dramatically 
different structure our system is likely to overfit since there is little evidence for 
useful minimization steps. Indeed we found this to be the case for the the artificial 
challenge problems of [11] where our system outputs a large number of rules and 
gets low accuracy. Interestingly, a similar effect can result from discretization 
since it results in a form of grounding of the initial clauses and thus counteracts 
the fact that they are universally quantified and thus likely to be contradicted by 
the dataset if wrong. This suggests that skipping the minimization step may lead 
to improved performance in such cases if pairings reduce clause size considerably. 
Initial experiments with this are as yet inconclusive. 

Our experiments demonstrated the utility of informed randomized subsump- 
tion tests. Another interesting possibility is to follow ideas from the successful 
randomized propositional satisfiability tester WalkSat [27]. Here one can aban- 
don the table structure completely and search for a single substitution using a 
random walk over substitutions where in each step we modify an unsuccessful 
substitution to satisfy at least one more atom. Repeating the above can improve 
performance as well as find multiple substitutions when needed. Initial experi- 
ments suggest that this indeed can be useful albeit our current implementation 
is slow. It would be interesting to explore this further in LogAn-H and other 
systems. 

Our system also demonstrates that using large refinement steps with a bot- 
tom up search can be an effective inference method. As discussed above, bottom 
up search suffers from two aspects: subsumption tests are more costly than in 
top down approaches, and overfitting may occur in small datasets with large ex- 
amples. On the other hand, it is not clear how large refinement steps or insights 
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gained by using LGG can be used in a top down system. One interesting idea 
in this direction is given in the system of [1]. Here repeated pairing- like opera- 
tions are performed without evaluating the accuracy until a syntactic condition 
is met (this is specialized for the challenge problems of [11]) to produce a short 
clause. This clause is then used as a seed for a small step refinement search that 
evaluates clauses as usual. Finding similar ideas that work without using special 
properties of the domain is an interesting direction for future work. 
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Abstract. This paper focuses on inductive learning of recursive logical 
theories from a set of examples. This is a complex task where the learning of 
one predicate definition should be interleaved with the learning of the other 
ones in order to discover predicate dependencies. To overcome this problem 
we propose a variant of the separate-and-conquer strategy based on parallel 
learning of different predicate definitions. In order to improve its efficiency, 
optimization techniques are investigated and adopted solutions are described. 
In particular, two caching strategies have been implemented and tested on 
document processing datasets. Experimental results are discussed and 
conclusions are drawn. 



1 Introduction 

Learning a single predicate definition from a set of positive and negative examples is 
a classical problem in ILP. In this paper we are interested into the more complex case 
of learning multiple predicate definitions, provided that both positive and negative 
examples of each concept/predicate to be learned are available. Complexity stems 
from the fact that the learned predicates may also occur in the antecedents of the 
learned clauses, that is, the learned predicate definitions may be interrelated and 
depend on one another, either hierarchically or involving some kind of mutual 
recursion. For instance, to learn the definitions of odd and even numbers, a multiple 
predicate learning system will be provided with positive and negative examples of 
both odd and even numbers, and may generate the following recursive logical 
theory: 

odd(X) <— succ(Y,X), even(Y) 
even(X) <— succ(Y,X), odd(Y) 
even(X) <— zero(X) 

where the definitions of odd and even are interdependent. This example shows that the 
problem of learning multiple predicate definitions is equivalent, in its most general 
formulation, to the problem of learning recursive logical theories. 

There has been considerable debate on the actual usefulness of learning recursive 
logical theories in knowledge acquisition and discovery applications. It is a common 
opinion that very few real life concepts seem to have recursive definitions, rare 
examples being “ancestor” and natural language [4, 14], Despite this scepticism, in 
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the literature it is possible to find several ILP applications in which recursion has 
proved helpful [10]. Moreover, many ILP researchers have shown some interest in 
multiple predicate learning [9], which presents the same difficulty of recursive 
theory learning in its most general formulation. 

To formulate the recursive theory learning problem and then to explain its main 
theoretical issues, some basic definitions are given below. 

Generally, every logical theory T can be associated with a directed graph 
}{T)=<N,E>, called the dependency graph of T, in which (i) each predicate of T is a 
node in N and (ii) there is an arc in E directed from a node a to a node b, iff there 
exists a clause C in 7, such that a and b are the predicates of a literal occurring in the 
head and in the body of C, respectively. 

A dependency graph allows representing the predicate dependencies of T, where a 
predicate dependency is defined as follows: 

Definition 1 (predicate dependency). A predicate p depends on a predicate q in a 
theory T iff (i) there exists a clause C for p in T such that q occurs in the body of C; 
or (ii) there exists a clause C for p in T with some predicate r in the body of C that 
depends on q. 

Definition 2 (recursive theory). A logical theory T is recursive if the dependency 
graph y(T) contains at least one cycle. 

In simple recursive theories all cycles in the dependency graph go from a 
predicate p into p itself, that is, simple recursive theories may contain recursive 
clauses, but cannot express mutual recursion. 

Definition 3 (predicate definition). Let T be a logical theory and p a predicate 
symbol. Then the definition of p in T is the set of clauses in T that have p in their 
head. Henceforth, <5(7’) will denote the set of predicates defined in T and k(T) will denote 
the set of predicates occurring in T, then 5(T)<^k(T). 

In a quite general formulation, the recursive theory learning task can be defined as 
follows: 

Given 

• A set of target predicates p,, p 2 , . . ., p r to be learned 

• A set of positive (negative) examples E + ( E] ) for each predicate p., 
l<i<r 

• A background theory BK 

• A language of hypotheses L H that defines the space of hypotheses S H 

Find 

a (possibly recursive) logical theory T e S H defining the predicates p p p 2 , ..., p r 
(that is, 5(7)={pj, p,, ..., p r }) such that for each i, l<i<r, BK\j T |= E* ( completeness 
property) and BKu7’ \p= E," ( consistency property). 

Three important issues characterize recursive theory learning. First, the generality 
order typically used in ILP, namely 9-subsumption [17], is not sufficient to guarantee 
the completeness and consistency of learned definitions, with respect to logical 
entailment [16]. Therefore, it is necessary to consider a stronger generality order, 
which is consistent with the logical entailment for the class of recursive logical 
theories we take into account. 
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Second, whenever two individual clauses are consistent in the data, their conjunction 
need not be consistent in the same data [8], This is called the non-monotonicity 
property of the normal ILP setting, since it states that adding new clauses to a theory 
T does not preserve consistency. Indeed, adding definite clauses to a definite 
program enlarges its least Herbrand model (LHM), which may then cover negative 
examples as well. Because of this non-monotonicity property, learning a recursive 
theory one clause at a time is not straightforward. 

Third, when multiple predicate definitions have to be learned, it is crucial to 
discover dependencies between predicates. Therefore, the classical learning strategy 
that focuses on a predicate definition at a time is not appropriate. 

To overcome these problems some solutions have been proposed in [12] and 
implemented in the learning system ATRE ( www.di.uniba.it/--malerba/software/atre ). 
This approach differs from related works for at least one of the following three 
aspects: the learning strategy, the generalization model, and the strategy to recover 
the consistency property of the learned theory when a new clause is added. 

In this paper we focus on the main problem of the interleaving of the learning of 
one (possible recursive) predicate definition with the learning of the other ones. In 
particular, different aspects of the adopted strategy for the automated discovery of 
predicate dependencies, namely the separate-and-parallel-conquer strategy, are 
presented. Efficiency problems due to the computational complexity of the search 
space are also discussed and some solutions implemented in a new version of the 
system ATRE are described. 

The paper is organized as follows. Section 2 illustrates details on the learning 
strategy. Section 3 introduces efficiency problems and related works. Section 4 
presents optimization approaches adopted in ATRE. The application of ATRE on 
real-world documents and results on efficiency gain are reported in Section 5. 
Finally, some conclusions are drawn. 



2 The Learning Strategy 

2.1 The Separate-and-Parallel-Conquer Search 

The high-level learning algorithm in ATRE belongs to the family of sequential 
covering (or separate-and-conquer) algorithms [13] since it is based on the strategy 
of learning one clause at a time, removing the covered examples and iterating the 
process on the remaining examples. Indeed, a recursive theory T is built step by step, 
starting from an empty theory T 0 , and adding a new clause at each step. In this way 
we get a sequence of theories 

T 0 =0, 7], ..., 72, T m , T = T, 

such that T m = T. u { C) for some clause C. If we denote by LHM(T’) the least 
Herbrand model of a theory T., the stepwise construction of theories entails that 
LHM(7j) c LHM(7j +( ), for each ie [0, 1, ..., n-1 }, since the addition of a clause to a 
theory can only augment the LHM. Henceforth, we will assume that both positive 
and negative examples of predicates to be learned are represented as ground atoms 
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with a + or - label. Therefore, examples may or may not be elements of the models 
LHM(7’). Let post IJIM(Tj ) and neg(LHM(T i )) be the number of positive and 
negative examples in LHM(7’ j ), respectively. If we guarantee the following two 
conditions: 

1 . pos( LHM( T.))< pos( LHM( T. +1 ) ) for each ;e { 0. 1 , ...,«- 1 ) , and 

2. neg( LHM(TJ ) = 0 for each ie { 0, 1 , . . . , n] , 

then after a finite number of steps a theory T, which is complete and consistent, is 
built. 

In order to guarantee the first of the two conditions it is possible to proceed as 
follows. First, a positive example e* of a predicate p to be learned is selected, such 
that e* is not in LHM(7’). The example e* is called seed. Then the space of definite 
clauses more general than e* is explored, looking for a clause C, if any, such that 
neg(LHM(T u {C})) = 0. In this way we guarantee that the second condition above 
holds as well. When found, C is added to T. giving T M . If some positive examples are 
not included in LHM(T +/ ) then a new seed is selected and the process is repeated. 

The second condition is more difficult to guarantee because of the second issue 
presented in the introduction, namely, the non-monotonicity property. The approach 
followed in ATRE to remove inconsistency due to the addition of a clause to the 
theory consists of simple syntactic changes in the theory, which eventually creates 
new layers, just as the stratification of a normal program creates new strata [1], 
Details on the layering approach and on the computation method are reported in [12], 
The layering of a theory introduces a first variation of the classical separate-and- 
conquer strategy sketched above, since the addition of a locally consistent clause 
generated in the conquer stage is preceded by a global consistency check. 

As explained above, in recursive theory learning it is necessary to consider a 
generality order that is consistent with the logical entailment for the class of 
recursive logical theories. The main problem with the well-known 9-subsumption is 
that the objects of comparison are two clauses and no additional source of knowledge 
(e.g., a theory T) is considered. Instead, we are only interested in those relative 
generality orders that compare two clauses relatively to a given theory T. In ATRE, a 
new generalization order named generalized implication is adopted [12], since both 
Buntine's generalized subsumption [5] and Plotkin’s [17,18] notion of relative 
generalization are not appropriate (they are either too strong or too weak). 

A solution to the problem of automated discovery of dependencies between target 
predicates p v p„ p r is based on another variant of the separate-and-conquer 
learning strategy. Traditionally, this strategy is adopted by single predicate learning 
systems that generate clauses with the same predicate in the head at each step. In 
multiple/recursive predicate learning, clauses generated at each step may have 
different predicates in their heads. In addition, the body of the clause generated at the 
;-th step may include all target predicates p v p„ .... p t for which at least a clause has 
been added to the partially learned theory in previous steps. In this way, 
dependencies between target predicates can be generated. 

Obviously, the order in which clauses of distinct predicate definitions have to be 
generated is not known in advance. This means that it is necessary to generate 
clauses with different predicates in the head and then to pick one of them at the end 
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of each step of the separate-and-conquer strategy. Since the generation of a clause 
depends on the chosen seed, several seeds have to be chosen such that at least one 
seed per incomplete predicate definition is kept. Therefore, the search space is 
actually a forest of as many search-trees (called specialization hierarchies ) as the 
number of chosen seeds. A directed arc from a node C to a node C' exists if C' is 
obtained from C by a single refinement step. Operatively, the (downward) 
refinement operator considered in this work adds a new literal to a clause . 1 



seeds 



even(O) 



odd(l) 



Level 0 even(X) <— odd(X) <— 




Level 1 even(X) <—zero(X) even(X) succ(X,Y) odd(X) <— succ(Y,X) odd(X) <— succ(X,Y) 




Level 2 even(X) <-zero(X) even(X) <-succ(X,Y) odd(X) <- succ(Y,X) odd(X) <-succ(Y,X) 
succ(X, Y) succ(Y,Z) zero(Y) succ(X,Z) 



seeds 
Level 0 

Level 1 

Level 2 even(X) <— succ( Y,X) even(X) <— succ(X,Y) 
succ(Z,Y) succ(Y,Z) 



odd(l) 
odd(X) <- 




odd(X) <- succ(Y,X) odd(X) <- succ(X,Y) 




odd(X) <-succ(Y,X) odd(X) <-succ(Y,X) 
zero(Y) even(Y) 



even(2) 
even(X) <— 




even(X) <— succ(Y,X) even(X) <— succ(X,Y) 




Fig. 1. Two steps (up and down) of the parallel search for the predicates odd and even. 
Consistent clauses are reported in italics. 

The forest can be processed in parallel by as many concurrent tasks as the number 
of search-trees (hence the name of separate-and-parallel-conquer for this search 
strategy). Each task traverses the specialization hierarchy top-down (or general-to- 
specific), but synchronizes traversal with the other tasks at each level. Initially, some 
clauses at depth one in the forest are examined concurrently. Each task is actually 
free to adopt its own search strategy, and to decide which clauses are worth to be 
tested. If none of the tested clauses is consistent, clauses at depth two are considered. 
Search proceeds towards deeper and deeper levels of the specialization hierarchies 
until at least a user-defined number of consistent clauses is found. Task 
synchronization is performed after that all “relevant” clauses at the same depth have 
been examined. A supervisor task decides whether the search should carry on or not 
on the basis of the results returned by the concurrent tasks. When the search is 
stopped, the supervisor selects the “best” consistent clause according to the user’s 
preference criterion. This strategy has the advantage that simpler consistent clauses 



i 



A discussion on properties of this operator is beyond the scope of this paper. A thorough 
description of upward and downward refinement operators can be found in [16]. 




